Du arbeitest als Data Scientist bei einer P-2-P-Plattform, die vor einem Jahr gegründet wurde. Nun wollt ihr euer Geschäft erweitern. Euer Team hat sich aufgeteilt und jeder Analyst hat einen Teilbereich der Daten. Deine Aufgabe ist es in einer explorativen Datenanalyse Insights für eure Plattform herauszufinden.
Euer Geschäftsmodell ist das Betreiben einer Plattform (crowd-investing) bei der sich Personen die eine Geschäftsidee haben, aber nicht das benötigte Geld, anmelden und für ihr Projekt innerhalb einer vorgegebenen Zeit Geld sammeln können. Auf der anderen Seite habt ihr Geldgeber, die gern ihr Geld in Projekte anlegen möchten und nach Investitionen suchen. Als Vermittler bringt eure Plattform also Geldnehmer und Geldgeber zusammen. Ihr verdient euer Geld mit einer Provision für jedes Projekt was auf eurer Plattform landet. Das Rückzahlungsrisiko wird über den Zinssatz abgebildet.
Deine Datenbasis ist die Historie eurer Plattform. Alle Projekte sind abgeschlossene Projekte, d.h. die Zeit, um für sein Projekt Geld zu sammeln ist abgelaufen. Euer Geschäftsmodell sieht es vor, dass die gesammelten Gelder ausgezahlt werden, auch wenn der Zielbetrag nicht erreicht wurde.
Dafür ist der Datensatz zunächst vorzubereiten. In die Bewertung geht zudem die Dokumentation mit ein. Für die Visualisierung sind 3 verschiedenartige Plots zu benennen, die in die Bewertung eingehen sollen. Das bedeutet nicht, dass du nur 3 Plots ersellen darfst - ausschließlich für die Bewertung sind drei verschiedenartige Plots zu benennen! Weiterhin sollen es verschiedenartige Plots sein! Das bedeuted, ein Balkendiagramm und ein grouped Balkendiagramm sind am Ende des Tages Balkendiagramme. Im Abschlussprojekt zeigst du nicht, dass du eine Art von Plot in verschiedenen Varianten darstellen kannst, sondern, dass du verschiedene Plots kannst.
Vergiss dabei nicht die Erkenntnisse die aus jedem Plot gezogen werden können, zu notieren.
Die Dokumentation kann in englisch oder deutsch erfolgen.
Am Freitag erfolgt die Vorstellung der Plots innerhalb des Zeitrahmens von 5min - (+-1min) ist dabei ok.
Treff ist Donnerstag um 14:30 Uhr im Hauptraum. Da werden wir die Unterlagen in dokumentensichere Formate umwandeln. Das machen wir aber zusammen. Danach habt ihr bis 15:35 Uhr zeit, eure zu bewertenden Unterlagen in den Projektabgabeordner hochzuladen.
Der gesplittete Datensatz enthält folgende Spalten (inkl. Bedeutung):
- funded_amount ... mit Ablauf der "Crowding"zeit erhaltener Betrag/ ausgezahlter Betrag in USD
- loan_amount ... Zielbetrag (Betrag dem man für das Projekt erreichen wollte) in USD
- activity ... Unterkategory zu dem das Ziel des Crowdprojektes thematisch gehört
- sector ... Oberkategory in den das Crowdprojektes Thema fällt
- use ... Kurzbeschreibung wofür das Geld verwendet werden soll
- country_code ... Ländercode nach ISO Norm
- country ... Ländername nach ISO Norm
- region ... Region
- currency ... Währung in den der funded_amount dann ausgezahlt wurde
- term in months ... Dauer über die der Kredit ausgezahlt werden soll
- lender_count ...Darlehensgeber (also wieviele Personen Geld für das Projekt gegeben haben)
- borrower_genders ... Geschlecht und Anzahl der Darlehensnehmer, also diejenigen die das Crowdprojekt initiiert haben
- repayment interval ... vertraglich vereinbarte Rückzahlungsmodalitäten/-rhythmus
to do:
1.1) Datensatz einlesen
1.2) fehlende Werte
1.3) Ausreißer
1.4) Pairplot - erste visuelle Inspektion
1.5) Features
1.6) Speicherplatz
# benötigte bib importieren
import pandas as pd
import numpy as np
from collections import Counter
# benötigte bib für Visualisierung
import seaborn as sns
import plotly.io as pio
import plotly.express as px
pio.renderers.default = "notebook"
# csv einlesen und vorher trennzeichen identifizieren
file_1 = "your_part1.csv"
df1 = pd.read_csv(file_1,
sep="/n",
engine='python',
nrows=2)
df1
# , ist trennzeichen
| , funded_amount, loan_amount, activity, sector, use, country_code, country, region, currency, term_in_months, lender_count, borrower_genders, repayment_interval | |
|---|---|
| 0 | 0,300.0,300.0,Fruits & Vegetables,Food,"To buy... |
| 1 | 1,575.0,575.0,Rickshaw,Transportation,to repai... |
# df mit korrektem trennzeichen einlesen
df1 = pd.read_csv(file_1,
sep=",",
index_col=0)
df1.head(3)
# parameter index_col=0 führt dazu, dass die erste spalte (hier der index) als Index verwendet wird und somit nicht
# als "Unnamed 0" spalte auftaucht
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
# csv einlesen und vorher trennzeichen identifizieren
file_2 = "your_part2.csv"
df2 = pd.read_csv(file_2,
sep="/n",
engine='python',
nrows=2)
df2
# # ist trennzeichen
| # funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval | |
|---|---|
| 0 | 0#175.0#175.0#Liquor Store / Off-License#Food#... |
| 1 | 1#325.0#325.0#Livestock#Agriculture#to buy 3 z... |
df2 = pd.read_csv(file_2,
sep="#",
index_col=0)
df2.head(3)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 175.0 | 175.0 | Liquor Store / Off-License | Food | to purchase additional stock of coconut wine t... | PH | Philippines | Palo, Leyte | PHP | 8.0 | 6 | female | irregular |
| 1 | 325.0 | 325.0 | Livestock | Agriculture | to buy 3 zebus and food to fatten them up. | MG | Madagascar | Antsirabe | MGA | 12.0 | 13 | female | monthly |
| 2 | 550.0 | 550.0 | Food Stall | Food | to buy ingredients for her food-vending busine... | PH | Philippines | Cordova, Cebu | PHP | 5.0 | 6 | female | irregular |
df1.columns
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
' country_code', ' country', ' region', ' currency', ' term_in_months',
' lender_count', ' borrower_genders', ' repayment_interval'],
dtype='object')
df2.columns
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
' country_code', ' country', ' region', ' currency', ' term_in_months',
' lender_count', ' borrower_genders', ' repayment_interval'],
dtype='object')
All the column names are same so we should join dataframes by index.
df = pd.concat([df1, df2], axis= 0).reset_index(drop=True)
df.head(3)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
df.shape
(671205, 13)
df.columns
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
' country_code', ' country', ' region', ' currency', ' term_in_months',
' lender_count', ' borrower_genders', ' repayment_interval'],
dtype='object')
Spaces in column names shuold be deleted
df.columns = df.columns.str.strip()
df.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
Check for duplicates
df.loc[df.duplicated()==True,:].head(3)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 498 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 606 | 100.0 | 100.0 | Home Energy | Personal Use | to buy a solar-powered lamp. | SV | El Salvador | NaN | USD | 14.0 | 4 | male | monthly |
| 808 | 450.0 | 450.0 | Higher education costs | Education | to pay for one semester's registration fees. | CO | Colombia | Bogotà | COP | 7.0 | 15 | female | monthly |
# df.drop(index=df.loc[df.duplicated() == True].index, inplace=True)
df.drop_duplicates(inplace=True, ignore_index=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 646833 entries, 0 to 646832 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null float64 1 loan_amount 646833 non-null float64 2 activity 646833 non-null object 3 sector 646833 non-null object 4 use 642934 non-null object 5 country_code 646825 non-null object 6 country 646833 non-null object 7 region 590670 non-null object 8 currency 646833 non-null object 9 term_in_months 646833 non-null float64 10 lender_count 646833 non-null int64 11 borrower_genders 642945 non-null object 12 repayment_interval 646833 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 64.2+ MB
First finfings (Missing values):
# Check activity for the strange value
df.activity.unique()
array(['Fruits & Vegetables', 'Rickshaw', 'Transportation', 'Embroidery',
'Milk Sales', 'Services', 'Dairy', 'Beauty Salon', 'Manufacturing',
'Food Production/Sales', 'Wholesale', 'General Store',
'Clothing Sales', 'Poultry', 'Tailoring', 'Sewing', 'Bakery',
'Restaurant', 'Food Stall', 'Farming', 'Construction Supplies',
'Personal Products Sales', 'Home Products Sales',
'Natural Medicines', 'Fish Selling', 'Education provider',
'Shoe Sales', 'Machinery Rental', 'Butcher Shop', 'Pigs',
'Personal Expenses', 'Food Market', 'Cosmetics Sales',
'Personal Housing Expenses', 'Retail', 'Energy', 'Grocery Store',
'Construction', 'Agriculture', 'Motorcycle Transport',
'Charcoal Sales', 'Food', 'Pharmacy', 'Fishing', 'Timber Sales',
'Cattle', 'Electronics Repair', 'Electronics Sales', 'Vehicle',
'Cafe', 'Blacksmith', 'Higher education costs', 'Used Clothing',
'Fuel/Firewood', 'Upholstery', 'Catering', 'Animal Sales',
'Cereals', 'Vehicle Repairs', 'Arts',
'Cloth & Dressmaking Supplies', 'Mobile Phones', 'Spare Parts',
'Clothing', 'Metal Shop', 'Barber Shop', 'Furniture Making',
'Crafts', 'Home Energy', 'Home Appliances', 'Wedding Expenses',
'Taxi', 'Secretarial Services', 'Livestock', 'Property',
'Recycling', 'Farm Supplies', 'Auto Repair', 'Beverages',
'Plastics Sales', 'Electrical Goods', 'Carpentry', 'Photography',
'Jewelry', 'Bricks', 'Pub', 'Phone Use Sales',
'Water Distribution', 'Paper Sales', 'Computers',
'Liquor Store / Off-License', 'Utilities', 'Knitting', 'Weaving',
'Party Supplies', 'Medical Clinic', 'Internet Cafe',
'Consumer Goods', 'Cement', 'Electrician',
'Primary/secondary school costs', 'Veterinary Sales',
'Land Rental', 'Laundry', 'Call Center', 'Perfumes', 'Hotel',
'Motorcycle Repair', 'Movie Tapes & DVDs', 'Quarrying',
'Personal Medical Expenses', 'Bookstore', 'Decorations Sales',
'Recycled Materials', 'Office Supplies', 'Souvenir Sales',
'Renewable Energy Products', 'Health', 'Printing', 'Phone Repair',
'Traveling Sales', 'Flowers', 'Bicycle Repair', 'Entertainment',
'Phone Accessories', 'Hardware', 'Used Shoes',
'Music Discs & Tapes', 'Games', 'Balut-Making', 'Textiles',
'Child Care', 'Goods Distribution', 'Florist', 'Cobbler', 'Dental',
'Bookbinding', 'Cheese Making', 'Bicycle Sales', 'Well digging',
'Technology', 'Musical Performance', 'Waste Management', 'Film',
'Tourism', 'Musical Instruments', 'Religious Articles',
'Machine Shop', 'Cleaning Services', 'Sporting Good Sales',
'Patchwork', 'Funerals', 'Air Conditioning', 'Communications',
'Adult Care', 'Landscaping / Gardening', 'Aquaculture',
'Beekeeping', 'Event Planning', 'Celebrations', 'Computer',
'Personal Care Products', 'Mobile Transactions'], dtype=object)
# Check sector for the strange value
df.sector.unique()
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture',
'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction',
'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'],
dtype=object)
df.currency.unique()
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
df.borrower_genders.unique()
array(['female', 'female, female', 'female, female, female', ...,
'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
dtype=object)
df.repayment_interval.unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
# Check the nulls of columns
df.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 3899 country_code 8 country 0 region 56163 currency 0 term_in_months 0 lender_count 0 borrower_genders 3888 repayment_interval 0 dtype: int64
# Check all nulls happens at the same time
df.loc[(df.borrower_genders.isnull()) & (df.region.isnull()) & (df.region.isnull()),:].count()
funded_amount 3888 loan_amount 3888 activity 3888 sector 3888 use 0 country_code 3888 country 3888 region 0 currency 3888 term_in_months 3888 lender_count 3888 borrower_genders 0 repayment_interval 3888 dtype: int64
NaN values: So what we see here is Region column has too many NaNs and not good, so we should check if we can delete this column. The null values in other columns can be deleted as rows The null values in Country code should be checked.
df.loc[df.region.isna() == True,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 250.0 | 250.0 | Services | Services | purchase leather for my business using ksh 20000. | KE | Kenya | NaN | KES | 4.0 | 6 | female | irregular |
| 49 | 450.0 | 450.0 | General Store | Retail | to stock his store. | SV | El Salvador | NaN | USD | 14.0 | 18 | male | monthly |
| 54 | 225.0 | 225.0 | Food Market | Food | to purchase various seasonal items to resell: ... | SN | Senegal | NaN | XOF | 14.0 | 7 | female | monthly |
| 67 | 125.0 | 125.0 | Energy | Services | purchase solar lanterns for resale. | KE | Kenya | NaN | KES | 3.0 | 6 | male | irregular |
| 70 | 2000.0 | 2000.0 | Retail | Retail | to install a display window and a sunshade for... | IQ | Iraq | NaN | USD | 15.0 | 71 | male | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646822 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 646823 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 646824 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 646825 | 0.0 | 25.0 | Games | Entertainment | Kiva Coordinator replaced loan use. Should see... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
| 646826 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
56163 rows × 13 columns
df.loc[df.borrower_genders.isna() == True,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 140 | 2975.0 | 2975.0 | Food Production/Sales | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 110 | NaN | monthly |
| 145 | 1200.0 | 1200.0 | Personal Expenses | Personal Use | NaN | PE | Peru | NaN | PEN | 20.0 | 44 | NaN | monthly |
| 170 | 4250.0 | 4250.0 | Catering | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 116 | NaN | monthly |
| 412 | 2350.0 | 2350.0 | Beauty Salon | Services | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 75 | NaN | monthly |
| 414 | 725.0 | 725.0 | Agriculture | Agriculture | NaN | SV | El Salvador | NaN | USD | 20.0 | 19 | NaN | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 636081 | 5625.0 | 10000.0 | Weaving | Arts | NaN | BT | Bhutan | NaN | USD | 14.0 | 210 | NaN | irregular |
| 637157 | 1975.0 | 1975.0 | Home Energy | Personal Use | NaN | PS | Palestine | NaN | USD | 27.0 | 39 | NaN | monthly |
| 637995 | 800.0 | 1600.0 | Furniture Making | Manufacturing | NaN | HT | Haiti | NaN | HTG | 13.0 | 27 | NaN | irregular |
| 646812 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 646826 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
3888 rows × 13 columns
# Check the numerical columns for strange words
cols = ["funded_amount", "loan_amount", "term_in_months", "lender_count"]
for x in cols:
signs = []
for element in df.loc[:, x]:
try:
float(element)
except:
signs.append(element)
print(f'{x}: {signs}')
funded_amount: [] loan_amount: [] term_in_months: [] lender_count: []
df.loc[df.region.isnull(),:].head(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 250.0 | 250.0 | Services | Services | purchase leather for my business using ksh 20000. | KE | Kenya | NaN | KES | 4.0 | 6 | female | irregular |
| 49 | 450.0 | 450.0 | General Store | Retail | to stock his store. | SV | El Salvador | NaN | USD | 14.0 | 18 | male | monthly |
| 54 | 225.0 | 225.0 | Food Market | Food | to purchase various seasonal items to resell: ... | SN | Senegal | NaN | XOF | 14.0 | 7 | female | monthly |
| 67 | 125.0 | 125.0 | Energy | Services | purchase solar lanterns for resale. | KE | Kenya | NaN | KES | 3.0 | 6 | male | irregular |
| 70 | 2000.0 | 2000.0 | Retail | Retail | to install a display window and a sunshade for... | IQ | Iraq | NaN | USD | 15.0 | 71 | male | monthly |
df.groupby(["country", "region"]).agg(nof_cr=("region", "size"))
| nof_cr | ||
|---|---|---|
| country | region | |
| Afghanistan | Kandahar City | 2 |
| Albania | Cerrik | 1 |
| Elbasan | 243 | |
| Korce | 753 | |
| Lac | 122 | |
| ... | ... | ... |
| Zimbabwe | Shurugwi | 120 |
| Umguza | 54 | |
| Umzingwane | 43 | |
| mwenezi | 106 | |
| wedza | 109 |
12749 rows × 1 columns
# To make sure region data compare null regions to not null regions
df_null_region = df.loc[df.region.isnull(),["country"]]
df_null_region = df_null_region.groupby(["country"]).agg(null_regions = ("country", "size"))
df_notnull_region = df.loc[df.region.isna() == False,["country"]]
df_notnull_region = df_notnull_region.groupby(["country"]).agg(notnull_regions = ("country", "size"))
df_null_region_compare = df_notnull_region.merge(df_null_region, how="left", on="country")
df_null_region_compare
| notnull_regions | null_regions | |
|---|---|---|
| country | ||
| Afghanistan | 2 | NaN |
| Albania | 1619 | 315.0 |
| Armenia | 8624 | 5.0 |
| Azerbaijan | 1772 | 170.0 |
| Belize | 122 | 2.0 |
| ... | ... | ... |
| Vanuatu | 4 | NaN |
| Vietnam | 9531 | 3.0 |
| Yemen | 2308 | 5.0 |
| Zambia | 747 | 28.0 |
| Zimbabwe | 3921 | 52.0 |
82 rows × 2 columns
# As a sample lets check USA as country
df_null_region_compare.index
Index(['Afghanistan', 'Albania', 'Armenia', 'Azerbaijan', 'Belize', 'Benin',
'Bhutan', 'Bolivia', 'Brazil', 'Burkina Faso', 'Burundi', 'Cambodia',
'Cameroon', 'Chile', 'China', 'Colombia', 'Congo', 'Costa Rica',
'Cote D'Ivoire', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Georgia', 'Ghana', 'Guatemala', 'Haiti', 'Honduras',
'India', 'Indonesia', 'Israel', 'Jordan', 'Kenya', 'Kyrgyzstan',
'Lao People's Democratic Republic', 'Lebanon', 'Lesotho', 'Liberia',
'Madagascar', 'Malawi', 'Mali', 'Mauritania', 'Mexico', 'Moldova',
'Mongolia', 'Mozambique', 'Myanmar (Burma)', 'Namibia', 'Nepal',
'Nicaragua', 'Nigeria', 'Pakistan', 'Palestine', 'Panama', 'Paraguay',
'Peru', 'Philippines', 'Rwanda', 'Saint Vincent and the Grenadines',
'Samoa', 'Senegal', 'Sierra Leone', 'Solomon Islands', 'Somalia',
'South Africa', 'South Sudan', 'Suriname', 'Tajikistan', 'Tanzania',
'Thailand', 'The Democratic Republic of the Congo', 'Timor-Leste',
'Togo', 'Turkey', 'Uganda', 'Ukraine', 'United States', 'Vanuatu',
'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
dtype='object', name='country')
df_null_region_compare.loc[df_null_region_compare.index == "United States",:]
| notnull_regions | null_regions | |
|---|---|---|
| country | ||
| United States | 920 | 5172.0 |
As a sample for United States there are 5172 NaN rows versus 920 normal values. This shows that this column has no reliable data and can be deleted.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 646833 entries, 0 to 646832 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null float64 1 loan_amount 646833 non-null float64 2 activity 646833 non-null object 3 sector 646833 non-null object 4 use 642934 non-null object 5 country_code 646825 non-null object 6 country 646833 non-null object 7 region 590670 non-null object 8 currency 646833 non-null object 9 term_in_months 646833 non-null float64 10 lender_count 646833 non-null int64 11 borrower_genders 642945 non-null object 12 repayment_interval 646833 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 64.2+ MB
# Delete Region column
df.drop(columns="region", axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 646833 entries, 0 to 646832 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 646833 non-null float64 1 loan_amount 646833 non-null float64 2 activity 646833 non-null object 3 sector 646833 non-null object 4 use 642934 non-null object 5 country_code 646825 non-null object 6 country 646833 non-null object 7 currency 646833 non-null object 8 term_in_months 646833 non-null float64 9 lender_count 646833 non-null int64 10 borrower_genders 642945 non-null object 11 repayment_interval 646833 non-null object dtypes: float64(3), int64(1), object(8) memory usage: 59.2+ MB
Here the values can not be completed with something like median or andere filling values. That doesn't make sense. In addition these Nan rows exist in both columns. The best option should be deleting the rows. 3900 rows against 640000 rows is not a big issue. But it can make other problems, some countries that falls completely in these rows might lose its representation.
# Delete Nan Rows in Use and Borrower_Gender columns
df.dropna(subset=["use", "borrower_genders"], inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 642934 entries, 0 to 646832 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 642934 non-null float64 1 loan_amount 642934 non-null float64 2 activity 642934 non-null object 3 sector 642934 non-null object 4 use 642934 non-null object 5 country_code 642926 non-null object 6 country 642934 non-null object 7 currency 642934 non-null object 8 term_in_months 642934 non-null float64 9 lender_count 642934 non-null int64 10 borrower_genders 642934 non-null object 11 repayment_interval 642934 non-null object dtypes: float64(3), int64(1), object(8) memory usage: 63.8+ MB
df.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 8 country 0 currency 0 term_in_months 0 lender_count 0 borrower_genders 0 repayment_interval 0 dtype: int64
# Lets check country code
df.loc[df.country_code.isnull(),:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 198369 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | NAD | 6.0 | 162 | female | bullet |
| 198646 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | NAD | 6.0 | 159 | male | bullet |
| 337622 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | NAD | 7.0 | 120 | female | bullet |
| 343771 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | NAD | 7.0 | 126 | male | bullet |
| 411637 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | NAD | 7.0 | 118 | female | bullet |
| 411895 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NaN | Namibia | NAD | 7.0 | 150 | male | bullet |
| 474672 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NaN | Namibia | NAD | 7.0 | 183 | male | bullet |
| 475099 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NaN | Namibia | NAD | 7.0 | 183 | female | bullet |
That can be the reason system understood the Namibia s country code "NA" as null.
# We make changes
df["country_code"] = df["country_code"].fillna("NA")
df.loc[df.country_code == "NA",:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 198369 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | NAD | 6.0 | 162 | female | bullet |
| 198646 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | NAD | 6.0 | 159 | male | bullet |
| 337622 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | NAD | 7.0 | 120 | female | bullet |
| 343771 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | NAD | 7.0 | 126 | male | bullet |
| 411637 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | NAD | 7.0 | 118 | female | bullet |
| 411895 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NA | Namibia | NAD | 7.0 | 150 | male | bullet |
| 474672 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NA | Namibia | NAD | 7.0 | 183 | male | bullet |
| 475099 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NA | Namibia | NAD | 7.0 | 183 | female | bullet |
df.isnull().sum()
funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 currency 0 term_in_months 0 lender_count 0 borrower_genders 0 repayment_interval 0 dtype: int64
# Check if "term_in_months" should be converted to integer
cols = ["term_in_months"]
for x in cols:
signs1 = set()
for element in df.loc[:,x]:
if element % 1 > 0:
signs1.add(element)
print(f'{x}: {signs1}')
term_in_months: set()
df.term_in_months = pd.to_numeric(df.term_in_months, downcast="integer")
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 642934 entries, 0 to 646832 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 642934 non-null float64 1 loan_amount 642934 non-null float64 2 activity 642934 non-null object 3 sector 642934 non-null object 4 use 642934 non-null object 5 country_code 642934 non-null object 6 country 642934 non-null object 7 currency 642934 non-null object 8 term_in_months 642934 non-null int16 9 lender_count 642934 non-null int64 10 borrower_genders 642934 non-null object 11 repayment_interval 642934 non-null object dtypes: float64(2), int16(1), int64(1), object(8) memory usage: 60.1+ MB
fig = px.scatter(data_frame=df, x="funded_amount", y="loan_amount")
fig.show()
sns.pairplot(data=df,corner=True)
<seaborn.axisgrid.PairGrid at 0x22a010bba00>
Findings (Pairplot):
df.loc[df.loan_amount == 100000,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 69745 | 100000.0 | 100000.0 | Agriculture | Agriculture | create more than 300 jobs for women and farmer... | HT | Haiti | USD | 75 | 2986 | female | irregular |
so the data is reasonable and won't make harm to our report.
We saw that most initiaters are women. And that can say something important to us. That's way we have decided to show the borrower_genders column in a better way. Existing structure of the column is not usable. For this we decided to make two new column from this column showing the number of male, female initiaters.
# New columns for borrower genders
df["borrower_male_count"] = df.borrower_genders.map(lambda x: [i.strip() for i in x.split(",")]) \
.map(lambda x: Counter(x)).map(lambda x: x["male"])
df["borrower_female_count"] = df.borrower_genders.map(lambda x: [i.strip() for i in x.split(",")]) \
.map(lambda x: Counter(x)).map(lambda x: x["female"])
df.loc[45:55,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | borrower_genders | repayment_interval | borrower_male_count | borrower_female_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | 150.0 | 150.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for c... | IN | India | INR | 43 | 6 | female | bullet | 0 | 1 |
| 46 | 250.0 | 250.0 | Fruits & Vegetables | Food | to purchase sacks of tomatoes, potatoes, fruit... | KE | Kenya | KES | 13 | 7 | female | irregular | 0 | 1 |
| 47 | 250.0 | 250.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for g... | IN | India | INR | 43 | 10 | female | bullet | 0 | 1 |
| 48 | 600.0 | 600.0 | Machinery Rental | Services | to invest in working capital and to maintain g... | NI | Nicaragua | NIO | 14 | 16 | female | monthly | 0 | 1 |
| 49 | 450.0 | 450.0 | General Store | Retail | to stock his store. | SV | El Salvador | USD | 14 | 18 | male | monthly | 1 | 0 |
| 50 | 3175.0 | 3175.0 | Butcher Shop | Food | to buy meat and also to start selling fish in ... | TZ | Tanzania | TZS | 10 | 93 | male, male, male, male, male | monthly | 5 | 0 |
| 51 | 175.0 | 175.0 | Pigs | Agriculture | to buy piglets and feed | PH | Philippines | PHP | 8 | 6 | female | irregular | 0 | 1 |
| 52 | 175.0 | 175.0 | Pigs | Agriculture | to purchase feed and vitamins for her pigs | PH | Philippines | PHP | 8 | 7 | female | irregular | 0 | 1 |
| 53 | 550.0 | 550.0 | Personal Expenses | Personal Use | to buy a cradle and household items for his yo... | PE | Peru | PEN | 14 | 20 | male | monthly | 1 | 0 |
| 54 | 225.0 | 225.0 | Food Market | Food | to purchase various seasonal items to resell: ... | SN | Senegal | XOF | 14 | 7 | female | monthly | 0 | 1 |
| 55 | 700.0 | 700.0 | General Store | Retail | to buy additional items like eggs, charcoal, r... | PH | Philippines | PHP | 5 | 2 | female | irregular | 0 | 1 |
df.drop(columns=["borrower_genders"], axis=1, inplace=True)
df.head(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | repayment_interval | borrower_male_count | borrower_female_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | PKR | 12 | 12 | irregular | 0 | 1 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | PKR | 11 | 14 | irregular | 0 | 2 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | INR | 43 | 6 | bullet | 0 | 1 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | PKR | 11 | 8 | irregular | 0 | 1 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | PKR | 14 | 16 | monthly | 0 | 1 |
#Correlation Matrix
import matplotlib.pyplot as plt
corr = df.corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, annot=True,
cmap = 'coolwarm', square=True)
plt.title('Correlation between different features')
corr
| funded_amount | loan_amount | term_in_months | lender_count | borrower_male_count | borrower_female_count | |
|---|---|---|---|---|---|---|
| funded_amount | 1.000000 | 0.951415 | 0.153453 | 0.846736 | 0.197196 | 0.471324 |
| loan_amount | 0.951415 | 1.000000 | 0.190599 | 0.801527 | 0.209109 | 0.446061 |
| term_in_months | 0.153453 | 0.190599 | 1.000000 | 0.233387 | -0.028207 | -0.154234 |
| lender_count | 0.846736 | 0.801527 | 0.233387 | 1.000000 | 0.170906 | 0.264898 |
| borrower_male_count | 0.197196 | 0.209109 | -0.028207 | 0.170906 | 1.000000 | 0.173048 |
| borrower_female_count | 0.471324 | 0.446061 | -0.154234 | 0.264898 | 0.173048 | 1.000000 |
Findings (Correlation Matrix):
After checking the Correlation Matrix, I saw that Lender count has dependancy between funded amount. So funding amount for each lender can say something to us at least: How much do people generally make funding in an event?
df["funding_pro_lender"] = df.funded_amount / df.lender_count
df.head(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | repayment_interval | borrower_male_count | borrower_female_count | funding_pro_lender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | PKR | 12 | 12 | irregular | 0 | 1 | 25.000000 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | PKR | 11 | 14 | irregular | 0 | 2 | 41.071429 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | INR | 43 | 6 | bullet | 0 | 1 | 25.000000 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | PKR | 11 | 8 | irregular | 0 | 1 | 25.000000 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | PKR | 14 | 16 | monthly | 0 | 1 | 25.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 642934 entries, 0 to 646832 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 642934 non-null float64 1 loan_amount 642934 non-null float64 2 activity 642934 non-null object 3 sector 642934 non-null object 4 use 642934 non-null object 5 country_code 642934 non-null object 6 country 642934 non-null object 7 currency 642934 non-null object 8 term_in_months 642934 non-null int16 9 lender_count 642934 non-null int64 10 repayment_interval 642934 non-null object 11 borrower_male_count 642934 non-null int64 12 borrower_female_count 642934 non-null int64 13 funding_pro_lender 639771 non-null float64 dtypes: float64(3), int16(1), int64(3), object(7) memory usage: 86.0+ MB
I think making median by countries seems more sensible. I will try to make so. (Tested but did not work)
df_neu = df.copy() for i in df_neu.country.unique(): if i == "Virgin Islands": median_v = df_neu.loc[:,'funding_pro_lender'].median() df_neu.loc[df_neu.country == i,:].replace(np.nan, median_v, inplace=True) else: median_v = df.loc[df['country']==i,'funding_pro_lender'].median() df.loc[df.country == i,:].replace(np.nan, median_v, inplace=True)
# Fill NaN values with median
median_v = df.loc[:,'funding_pro_lender'].median()
df.replace(np.nan, median_v, inplace=True)
df.head(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | currency | term_in_months | lender_count | repayment_interval | borrower_male_count | borrower_female_count | funding_pro_lender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | PKR | 12 | 12 | irregular | 0 | 1 | 25.000000 |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | PKR | 11 | 14 | irregular | 0 | 2 | 41.071429 |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | INR | 43 | 6 | bullet | 0 | 1 | 25.000000 |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | PKR | 11 | 8 | irregular | 0 | 1 | 25.000000 |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | PKR | 14 | 16 | monthly | 0 | 1 | 25.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 642934 entries, 0 to 646832 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 642934 non-null float64 1 loan_amount 642934 non-null float64 2 activity 642934 non-null object 3 sector 642934 non-null object 4 use 642934 non-null object 5 country_code 642934 non-null object 6 country 642934 non-null object 7 currency 642934 non-null object 8 term_in_months 642934 non-null int16 9 lender_count 642934 non-null int64 10 repayment_interval 642934 non-null object 11 borrower_male_count 642934 non-null int64 12 borrower_female_count 642934 non-null int64 13 funding_pro_lender 642934 non-null float64 dtypes: float64(3), int16(1), int64(3), object(7) memory usage: 86.0+ MB
For storage space optimization
df.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object currency object term_in_months int16 lender_count int64 repayment_interval object borrower_male_count int64 borrower_female_count int64 funding_pro_lender float64 dtype: object
df.memory_usage(deep=True)
Index 22051800 funded_amount 5143472 loan_amount 5143472 activity 44507999 sector 41716861 use 75870145 country_code 37933106 country 42081570 currency 38576040 term_in_months 1285868 lender_count 5143472 repayment_interval 41581393 borrower_male_count 5143472 borrower_female_count 5143472 funding_pro_lender 5143472 dtype: int64
df.memory_usage()
Index 22051800 funded_amount 5143472 loan_amount 5143472 activity 5143472 sector 5143472 use 5143472 country_code 5143472 country 5143472 currency 5143472 term_in_months 1285868 lender_count 5143472 repayment_interval 5143472 borrower_male_count 5143472 borrower_female_count 5143472 funding_pro_lender 5143472 dtype: int64
df_opti = df.copy()
# Downcast floats
df_opti[["funded_amount","loan_amount", "funding_pro_lender"]] = df_opti[["funded_amount", "loan_amount", "funding_pro_lender"]].apply(pd.to_numeric, downcast="float")
df_opti.dtypes
funded_amount float32 loan_amount float32 activity object sector object use object country_code object country object currency object term_in_months int16 lender_count int64 repayment_interval object borrower_male_count int64 borrower_female_count int64 funding_pro_lender float32 dtype: object
# Downcast integers
df_opti[["term_in_months", "lender_count","borrower_male_count", "borrower_female_count"]] = df_opti[["term_in_months", "lender_count","borrower_male_count", "borrower_female_count"]].apply(pd.to_numeric, downcast="integer")
df_opti.dtypes
funded_amount float32 loan_amount float32 activity object sector object use object country_code object country object currency object term_in_months int16 lender_count int16 repayment_interval object borrower_male_count int8 borrower_female_count int8 funding_pro_lender float32 dtype: object
# Check if we can convert objects to category
liste = df_opti.select_dtypes("object").columns
for spalte in liste:
print(spalte, df_opti.loc[:,spalte].nunique())
activity 163 sector 15 use 424912 country_code 87 country 87 currency 67 repayment_interval 4
# Umwandlung in Category Datentyp - mit astype()
df_opti = df_opti.copy()
liste_category = ['activity','sector','country_code', 'country', 'currency', "repayment_interval"]
for spalte in liste_category:
df_opti.loc[:,spalte] = df_opti.loc[:,spalte].astype('category')
# check, ob Umwandlung erfolgreich
df_opti.dtypes
funded_amount float32 loan_amount float32 activity category sector category use object country_code category country category currency category term_in_months int16 lender_count int16 repayment_interval category borrower_male_count int8 borrower_female_count int8 funding_pro_lender float32 dtype: object
df.memory_usage(deep=True)
Index 22051800 funded_amount 5143472 loan_amount 5143472 activity 44507999 sector 41716861 use 76437392 country_code 37933106 country 42081570 currency 38576040 term_in_months 1285868 lender_count 5143472 repayment_interval 41581393 borrower_male_count 5143472 borrower_female_count 5143472 funding_pro_lender 5143472 dtype: int64
df_opti.memory_usage(deep=True)
Index 22051800 funded_amount 2571736 loan_amount 2571736 activity 1301329 sector 644481 use 76437392 country_code 650171 country 650758 currency 649058 term_in_months 1285868 lender_count 1285868 repayment_interval 643362 borrower_male_count 642934 borrower_female_count 642934 funding_pro_lender 2571736 dtype: int64
df_opti.dtypes
funded_amount float32 loan_amount float32 activity category sector category use object country_code category country category currency category term_in_months int16 lender_count int16 repayment_interval category borrower_male_count int8 borrower_female_count int8 funding_pro_lender float32 dtype: object
df_opti.to_pickle('df_after_preprocessing.pkl')